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Abstract 

We introduce a new shrinkage variable selection operator for linear models 
which we term the adaptive ridge selector (ARiS). This approach is inspired by 
the relevance vector machine (RVM), which uses a Bayesian hierarchical linear 
setup to do variable selection and model estimation. Extending the RVM algo- 
rithm, we include a proper prior distribution for the precisions of the regression 
coefficients, vj 1 ~ /(vj 1 ^]), where rj is a scalar hyperparameter. A novel fit- 
ting approach which utilizes the full set of posterior conditional distributions 
is applied to maximize the joint posterior distribution p((3, a 2 , v" 1 ^, n) given 
the value of the hyper-parameter n. An empirical Bayes method is proposed 
for choosing i]. This approach is contrasted with other regularized least squares 
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estimators including the lasso, its variants, nonnegative garrote and ordinary 
ridge regression. Performance differences are explored for various simulated 
data examples. Results indicate superior prediction and model selection accu- 
racy under sparse setups and drastic improvement in accuracy of model choice 
with increasing sample size. 

KEYWORDS: Lasso; Elastic net; Shrinkage estimation; RVM; Ridge Regression; 
LARS algorithm; Penalized Least Squares. 

1 Introduction 

Consider the familiar linear regression model, y = X/3+£ where y is an n-dimensional 
vector of responses, X is the nxp dimensional model matrix and e is an n-dimensional 
vector of independent noise variables which are normally distributed, M p (0, cr 2 I p ) 
with variance a 2 . Let some arbitrary subset of the regression coefficients, {3, be 
zero meaning that the corresponding regressors do not contribute to the response 
in the underlying model. The ordinary least squares (OLS) estimate is obtained by 
minimizing the squared error loss. Although an unbiased estimator in this setting, the 
OLS estimator may have large variance and will incorrectly estimate the coefficients 
that are zero in the underlying model. As a consequence, the estimator will usually 
result in overly complex models which may be difficult to interpret. Conventionally, 
an analyst will use subset selection to arrive at a reduced model, which is easier to 
interpret and attains better prediction accuracy. The subset selection problem has 
been studied extensively (George, 2000). In cases with large numbers of variables, 
such methods suffer a fundamental limitation, the need for a greedy algorithm to 
search the discrete model space. 

A more direct approach to improve the prediction accuracy of OLS is based on 
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"shrinkage" estimators (Berger, 1985) which trade off increased estimator bias in re- 
turn for a compensating decrease in variance leading to a smaller mean squared error 
(MSE). A number of methods have been proposed that realize variable selection and 
estimation via some penalized least squares criteria, e.g. the nonnegative garotte 
(Breiman, 1995), the lasso (Tibshirani, 1996), the elastic net (Zou and Hastie, 2005), 
the adaptive lasso (Zou, 2006) and more recently the Dantzig selector (Candes and 
Tao, 2007), etc. While traditional ridge regression (Hoerl and Kennard, 2000) pro- 
poses an £ 2 -penalty on the coefficients, the lasso, and its variants make use of the 
^i-penalty. Tibshirani (1996) demonstrates that the £i-polytope, unlike £2, can touch 
the contours of the least squares objective function on one or more of the axes leading 
to estimates of zero for the associated regression coefficients. 

Tipping (2001) synthesized several current ideas in the Machine Learning litera- 
ture and offered two important hybrid algorithms for model selection and fitting in the 
kernel regression context. In particular, by combining kernel transformations of inde- 
pendent variables with classical elements of Bayesian hierarchical modeling (Lindley 
and Smith, 1972) he created the relevance vector machine (RVM). This approach is 
able to efficiently reduce the huge feature space created by the kernel transformations 
to a very parsimonious and predictive set improving significantly upon the less sta- 
tistical support vector regression method of Drucker et al. (1997). This continuous 
approach typically converges in a finite number of steps and provides very fast model 
selection for large number of variables without the concerns of Monte-Carlo noise or 
incomplete optimization typical of subset selection. A number of extensions to the 
RVM have been offered; see Tipping and Lawrence (2005); D'Souza et al. (2004). 

Recognizing the RVM as a Bayesian random effects model, we offer an alternative 
formulation which offers a more complete hierarchical structure. One major new de- 
velopment exploits this hierarchical structure to rapidly fit the model given a fixed 
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value for the key hyper-parameter. Unlike Tipping (2001), we do not integrate the re- 
gression coefficients out of the joint posterior distribution of the parameters. Instead, 
we use a conditional maximization procedure (Lindley and Smith, 1972) to obtain 
the posterior mode in an elegant and efficient way. This approach to model fitting 
was criticized by (Harville, 1977, Sec 8.3) in the context of standard linear mixed 
effects models due to the fact that it can lead to estimators of variance components 
that are identically zero and necessarily far from the posterior mean. Contrary to 
Harville's conclusion, we show that the zeroing effect is theoretically justified and can 
be easily exploited for variable selection. We refer to this procedure as the adaptive 
ridge selector (ARiS). Like the lasso, this results in a sparse shrinkage estimator which 
will zero irrelevant coefficients. The marginal likelihood p(y\rj) is maximized over the 
hyper-parameter t] in order to select the best estimator. This final empirical Bayes 
step adjusts the amount of shrinkage imposed on the model. Like RVM, this algo- 
rithm typically converges in a finite number of steps and provides rapid and effective 
model selection and fitting for models with very large numbers of variables. 

Section 2 introduces a hierarchical random effects model similar to that of Tipping 
(2001) and motivates our interest. Section 3 explains how the hierarchical structure 
is exploited to efficiently fit the model and describes the steps in the proposed ARiS 
algorithm. It also analyzes the problem in the form of a regularized least squares 
problem and contrasts the estimator with one based on the marginal distribution of 
the regression coefficients. Next, Section 4 focuses on deriving the marginal likelihood 
which is used to determine the optimal model. Because the marginal likelihood is 
analytically intractable, one must compute it through either analytical or simulation 
based approximation. We provide a Laplace approximation which is evaluated at the 
posterior mode along with a simulation based technique which results in somewhat 
more accurate solutions. Section 5 presents comparisons of the proposed method 
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along with alternatives on simulated data examples. Conclusions are discussed in 
Section 6. 

2 ARiS 

Beginning with a standard hierarchical linear model (Sorensen and Gianola, 2002, 
Section 6.2.2) we propose a basic modification. In this case, the joint posterior of the 
parameters is proportional to, 

p(/3, v- 1 , a 2 |y, H) cx p(y|/3, a 2 )p((3\a 2 , v" 1 )^" 1 V )p(a 2 ). (1) 

Here a normal likelihood is assumed, p(y\(3, a 2 ) ~ 7V(X/3, cr 2 I), along with a conjugate 
normal prior on the regression coefficients, /3, and a typical Jeffreys prior on the error 
variance a 2 , 

pC/V.v- 1 ) ~ Af(0,a 2 V) (2) 
p(a 2 ) cx I/a 2 . (3) 

As with the relevance vector machine of Tipping (2001), the vector v^ 1 = diagl^V^ 1 ) 
where V is a diagonal matrix with elements Vj, j = 1, ...,p the reciprocals of which 
are independent and identically distributed from a gamma distribution, 

u r ' +1 

P{ - V i 1} = T(ri+l) Vj " 6XP ^ (4) 

where \x is the inverse scale parameter, and rj is the shape parameter. By definition 
Vj > 0, \i > and r/ > — 1. Notice that when 77 = 0, this becomes an exponential 
distribution which we will consider ClS db special case. 
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Tipping (2001) sets rj = — 1 and fi = which leads to a scale invariant improper 
prior. He then derives a marginal likelihood p (y|cr 2 , t>jf x , ...jf" 1 ) through direct in- 
tegration which is then maximized with respect to a 2 and vj . Hypothetically as 
the algorithm proceeds some VjS will tend toward which correspond to the irrele- 
vant variables in the model. Tipping (2001) does not check the validity of the joint 
posterior density having assumed an improper prior on vj . 

In contrast to Tipping (2001), we choose not to integrate the regression coefficients 
out of the joint posterior distribution, but instead proceed to find the modal value 
given the data and rj. Here we fix fi to be a very small number (e.g. machine epsilon) 
and adjust r] to control shrinkage. 

Sparsity is obtained via the combination of these two particular priors, p(j3j\a 2 , v J ) 
and p(vj ). Integration over vj in the joint prior distribution p(Pj,vJ 1 ) will reveal 
that the marginal prior density of the regression coefficients is a product of univariate 
t densities, with a scale of a/ (fj,a 2 )/(rj + 1) and degrees of freedom of 2rj + 2. It is 
important to note that the product of univariate t-densities is not equivalent to a 
multivariate t and does not have elliptical contours but instead produces ridges along 
the axes. These ridges can be made more drastic by choosing the scale parameter 
to be small; see Figure 1. Then the posterior will be maximized where ever these 
ridges first touch the contours of the likelihood. The parameter rj plays a very similar 
role to the regularization parameter of the ridge regression, lasso, etc., with larger 
values encouraging further shrinkage. Hence the proposed hierarchical structure im- 
plies independent t priors being placed on each regression coefficient. Direct use of 
such t priors would obscure the conjugate nature of the model. From an optimization 
perspective, a direct use of such priors leads to a non-convex objective function which 
would not be desirable. As we will discuss, within the hierarchical structure each 
iteration solves a simpler convex problem leading to an efficient solution. 
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Once Vj 1 s are integrated out of the joint posterior, the problem can be seen 
analogously gularized least squares problem as \x — > which solves 

n p 

min iVi - **/3) 2 + A ]T log(#). (5) 

i=i j=i 

Having specified a complete hierarchical model, the joint posterior distribution of 
the parameters is obtained by the product of the likelihood and the specified priors 
up to a normalizing constant as 

p(l3,a 2 ,v- l \y,H) oc p(y|/3, a 2 )p{(3\a 2 , v- l )p{a 2 )p{v- x \H) 

oc (a 2 )-^ +1 ) flvT^-y^Tir) + 1)"* 

(y - X/3)'(y - X/3) + 



x exp 



2a 2 



x exp \-^J2 v j( ( 6 ) 

i=i 



where H = (rj, /x). 



Theorem 1. Given the priors in (2), (3), (4), the product of these prior densities 
and the normal likelihood, (6) is the kernel of a posterior density function for (3, a 2 , 
and v _1 . 

Proof for this theorem can be found in Appendix A. The conditional distributions 
of the parameters can now easily be derived from this joint distribution. 

i. The regression coefficients are distributed as multivariate normal conditional on 
the error variance a 2 and the prior covariance of the regression coefficients V. 

/3|aV-\y~AT p (AV-V), (7) 
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where f3 = (X'X + V" 1 )" 1 X'y and V = X'X + V" 1 . 

ii. The error variance is distributed as inverse gamma conditional upon all other 
parameters. 



p(<r 2 \{3,v-\y) oc (a 2 r ( ^ +1) 

(y-X/3)'(y-X/3)+/3'V-V3 



x exp 



2a 2 



Thus, 

o" 2 1/3, v" 1 , y ~ inverse — gamma (u*, A*) , (9) 
where u* = (n + p) /2 and A* = fr-^'fr-^+gv-ifl 

iii. The prior precisions of the regression coefficients, conditional on all other pa- 
rameters, follow a gamma distribution. 

pCv- 1 ! A a 2 , y, H) oc f[(vj^ exp ± f'V } 

oc n(^)(f+^)exp [ J A^vJ^ . (10) 



Thus, 

v]' 1 \Pj,o :2 ,y,H ~ gamma (r}*,^) , (11) 
where 77* = 3/2 + 77 and /i* = + 2a 2 /i)/2(j 2 . 

Deriving the full set of conditional distributions has several uses. As is frequently 
done, we may utilize these to simulate from the posterior distribution using Gibbs 
sampling. Such an approach would allow us to compute traditional Bayes estimators 
for the regression coefficients. In Section 3 we show how to use the conditional 



distributions to maximize the joint posterior in a surprisingly simple and effective 
way. Maximization will also facilitate computation of the Laplace approximation to 
the marginal likelihood (Tierney and Kadane, 1986). 



3 Computing Posterior Modes 

Lindley and Smith (1972) proposed an optimization algorithm to find the joint pos- 
terior modes; see also Chen et al. (2001). Once the fully conditional densities of 
the model parameters are obtained, it is possible to maximize the joint posterior 
distribution by iteratively maximizing these conditional densities. 

Since the conditional posterior distributions obtained in equations (7), (9), and 
(11) are well-known distributions with readily available modes, the Lindley-Smith 
optimization algorithm becomes rather appealing to implement. The modes for the 
distributions in equations (7), (9), and (11) respectively are 



(3 = (X'X + V- 1 ) x X'y (12) 

a 2 = -A- (13) 
(3 2 + 2a 2 fi , 

«i - (frw i L2 » < 14 > 

where u*, X* were defined in (9). The maximization proceeds through sequential re- 
estimation of a 2<yl \ Vj \ where / = 1, ...,m, m is the number of iterations, and 
(3^ is the OLS estimator. 



3.1 Relation to Regularized LS 

Given V, the modal value of (3 can be obtained as a solution to a penalized least 
squares problem as with ridge regression. Since we have an iterative procedure, let 
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Vj be the jth diagonal of at the Ith iteration and be given. Then the Zth iterate 
for j3 is the solution to a similar penalized least squares problem: 

= arg min ]T ( Vi - x,/3) 2 + ^ ^L. (15) 

i=l j=l V j 

If we substitute v 3 - with the estimate from (14) and let fi — > 0, we obtain 

n P n2 

/3« = an? min V (y, - x,/3) 2 + (1 + 2t?) V (16) 

1=1 j=l ^3 



where dp = + y df l ^ /a 2 ®- This procedure is essentially re- weighting the predictor 
variables by the positive square root of the ratio between the current estimate of the 
coefficients and the residual variance due to them. After this re-weighting procedure, 
the problem takes the form of a standard ridge regression, 



n . . 2 P 



= arg min £ (y, - x^V) + (1 + H j>* 2 , (17) 



i=l j=l 



where (3* = fy/ujj, x* = x*" ^u;*^ and x*^ = Xj. The solution to the problem 
above at iteration / is given by 

0*(O = ( X '*«X*W + (1 + 2t?)I) _1 X'*Wy- (18) 

Hence the mode is computed through a sequence of re-weighted ridge estimators. 
The final estimate then can be recovered as (3*^ x ^Q™! ujf\ YliLi J 
(this multiplication is understood component wise). Note that when rj = —1/2, this 
procedure results in the OLS estimator. 

We construct a two-dimensional example to illustrate the method . Consider 
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the model yi = (3\Xn + fti%x-i + A/"(0, 1), where Pi = and (3 2 = 3. We generate 
30 observations and run ARiS for 77 = 0. Figure 2 clearly demonstrates how the 
shrinkage proceeds throughout our algorithm. The constrained region eventually 
becomes singular along the dimension which has no contribution to the response in 
the underlying model. ARiS iteratively updates the constrained region and converges 
to a solution. 

3.2 The Marginal Posterior Mode of (3 via EM 

An expectation-maximization approach may be used to obtain the marginal posterior 
mode of (3. Consider the identity 

p{a 2 ,v x \y,p) 

Taking the logarithm and then taking the expectations of both sides with respect to 
p (a 2 ,v~ 1 |/3«) yields 

logp(/3|y) = logpt^er^v^ly) - logp(a 2 , v _1 |y, /3) 

= J \ogp{(3,a 2 ^ l \y)p{a 2 ^ 1 \^)da 2 d^ 1 

-J \ogp(a 2 ,^ 1 \y,(3)p{ ( r 2 ,^ 1 \^)da 2 d^-\ (20) 

where f3^ 1 ' is the current guess of (3 (Sorensen and Gianola, 2002, pg. 446). The EM 
algorithm involves working with the first term of (20). The EM procedure in our case 
would consist of the following two steps: (i) expectation of logp(/3, a 2 , v _1 |y) with 
respect to p (a 2 , v~ 1 \f3^) , (ii) maximization of the expected value with respect to (3. 
An iterative procedure results by replacing the initial guess (3^ with the solution of 
the maximization procedure /3^ +1 - ) and repeating (i) and (ii) until convergence. 
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With a slight change in the hierarchical model used, the above expectation will 
become quite trivial. Unlike (2), let us not condition the prior density of (3 on a 2 . 
Under such a setup, p (<r 2 , v -1 |y, = p (a 2 |y, /3^) p (v _1 |y, /3^). Notice that the 
conditional posteriors in (9) and (11) now become 



, 2,« n f 2x- (2+ i) / (y - X/3)'(y - X/3) | 
p(<7 |/3,y)oc(a) ^+>exp^ — > 



(21) 



p^lAy,^) oc (vj^^exp {- (f - + ^} • (22) 

Given the new prior, let us re-write (6) in the log form excluding the terms that do 
not depend on (3, a 2 and v _1 : 

p 

\ogp(f3,a 2 ^~ 1 \y,H) oc -(n/2 + 1) log(a 2 ) + (1/2 + rj) ^log^" 1 

j'=i 

2cx 2 2 



Next we compute E v -ii y ^(oE^iy ]/3 (o logp ((3, a , v |y, if) as /i — > 0: 



(n/2 + l)E ff2|y>/ 3 (i) log(a 2 ) + (1/2 + 77)E v _ 1|yi/3(!) ^ log^" 1 



(y - X/3)' (y - X/3) 3 ' 

2S 2 ^/n 2 J ^/? 2(0 



(24) 



where S 2 « = (y - X/3«)' (y - X/3^). 

Having completed the expectation step, the maximization of (24) with respect to 
(3 yields an estimator as the solution to the following sequence of convex minimization 
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problems: 



p 



01 



/3( /+1 ) = arg min (y - X/3)' (y - X/3) + (2r? + 3) 



(25) 



Hence, the marginal posterior mode of /3 is extremely similar in form to the joint 
posterior mode. Note that we can still adopt the weighting perspective mentioned 



(5j results in a univariate t-density with degrees of freedom 2r] + 2. Therefore, a value 
of t] = —3/2 will actually lead to a flat prior over j3j resulting in the OLS estimator 
(note that when rj = —3/2, the kernel of the t density has power resulting in a flat 
density). Also, the solution to the marginal when rj = — 1 will be identical to the 
maximization of the joint posterior when i] = 0. 

Harville (1977) mentions that the estimator of Lindley and Smith (1972) based 
on joint maximization may be far from the Bayes estimator and suggests that the 
maximization of the marginal mode of the variance components would be a superior 
approach. Above we have shown that in our case the mode of the marginal density has 
the same form as the joint mode justifying the conditional maximization approach. 
In fact, Tipping adopts the approach suggested by Harville, maximizing the joint 
posterior mode of v _1 and a 2 after integrating over (3 but still achieves the zeroing 
effect. 

We needed a slight change in the model to ease our work for the expectation step 
above, that is, we made the prior distribution of the regression coefficients indepen- 
dent of the error variance. One may think that while we were trying to show the 
equivalence of these two solutions (the joint and the marginal solutions), we actually 
created two different models and show that their solutions are identical in form, yet, 
they do not follow the same model. In the traditional Bayesian analysis of the linear 



earlier. Recall from Section 2 that the integration over u • 1 in the prior distribution of 



13 



regression models, the regression coefficients are conditioned over the noise variance. 
This provides an estimator for the regression coefficients that is independent of the 
noise variance. We followed this convention when we were forming our hierarchical 
model. However, in our case, there is no such thing as independence between the 
solutions of the regression coefficients and the noise variance. Although in an explicit 
statement such as (12) it may seem that the solution for the regression coefficients 
does not depend on the error variance, there exists an implicit dependence through 
the solution of s. We could have very well constructed our hierarchical model 
using a prior on regression coefficients independent of the noise variance. This would 
lead to a solution that is only slightly different. The re-estimation equation in (12), 
(13) and (14) would become 

(3 = (X'X + ^V-y'x'y (26) 
?2 = (y - X/3)'(y - Xg) 

72 + 2 

= Zl 1 j = 1,2,..., p. 28 

Now, the solution for vj is independent of a 2 , yet the solution of (3 explicitly depends 
on a 2 . That said, the implicit dependence has become an explicit one. Thus the 
iterative solution for the regression coefficients as n — > can be written as 

(3^ = arg min (y - X/3)' (y - X/3) + (2rj + 1) £ J} (29) 

in which, apart from the tuning quantity (2r/ + 1), the only difference with (25) is the 
plug-in estimator used for the noise variance. 

Having shown that these procedures are fundamentally identical in form to each 
other, let us discuss another important point, the choice of initial values to start the 
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algorithm. Let us consider the solution following (25) with rj 



-1: 



/ 



1\ 



-i 



i 








X'X + 



x'y 



(30) 



V 







We cannot just plug any (3^ as an initial estimator. Consider (3^ = 0. In this case 
all the regression coefficients will be zeroed. Or let only a subset of f3^ ' be zero. Then 
in the solution those components will remain zero. Although we are solving a series 
of simple convex problems, the dependency of the solution to the initial value proves 
that here we are dealing with a multi-modal objective function as would be expected. 
Thus using the OLS estimator as an initial value will take us to a local stationary 
point which is most likely under the support of the data in hand. 

To gain further intuition, let us consider an orthogonal case and let the predictors 
be scaled so that they have unit 2-norm, i.e. X'X = I. In such a case the OLS 
estimator for f3 would have a variance-covariance matrix a 2 I where a 2 is a plug-in 
estimator, e.g. the maximum likelihood estimator or the bias corrected estimator. 
Testing the null hypothesis H : j3j = 0, a t-statistic can be computed for a compo- 
nent (3j as (3j/a 2 . Notice in (30) the quantities at the diagonal of the second piece 
under the matrix inverse operation, S 2 ^ /n(3^ l \ resemble the inverse of a squared t- 
statistic. In fact, recall from Section 3.1 that we formed a sequence of ridge regression 
problems out of this procedure by re-weighting our predictors by \f3j l /a®\ which is 
the absolute value of a t-statistic. Thus, following a conventional testing procedure, 
those predictors which correspond to coefficients with larger t-statistics will be given 
more importance. This is yet another point that intuitively explains our procedure. 
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4 Approximating the marginal likelihood 

Critical to the ARiS procedure is the choice of the hyper-parameter 77. We propose an 
empirical Bayes estimation of 77 through the maximization of the marginal likelihood 
p{y\f])- Hence, we must integrate the joint posterior over all parameters, 



where 6 = (/3, a 2 , v" 1 )'. In the case of the hierarchical model developed in Section 2, 
the direct calculation is intractable. Below we propose both analytical and simulation- 
based approximations. 

4.1 Laplace approximation 

A standard analytical approximation of the marginal likelihood can be computed 
using the Laplacian method (Tierney and Kadane, 1986). The approximation is 
obtained as 



where 9 is the mode of the joint posterior and Hg is the Hessian matrix given in B 
evaluated at the posterior mode. 

Recall that the ARiS is designed to drive the values of j3j and Vj to zero for those 
independent variables Xj which provide no explanatory value. As \i — > 0, the prior 
precisions and the regression coefficients related to irrelevant independent variables 
will tend toward 00 and respectively. In fact we can see in (40) that along these /3j 
the curvature approaches 00 as we converge to the solution thus driving their variance 
to 0. At the joint posterior mode, the corresponding dimensions of X do not contribute 




(31) 




(32) 
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and become irrelevant. Under the support of the data, we claim these variables to 
be insignificant and suggest their removal from the model. The integration follows 
removal of these irrelevant variables from the model. 

The resulting Laplace approximation to the log-marginal likelihood is 



logp(y|?7) « \ogp((3\a 2 ,v\y\H) + — log(27r) 



log 



H 



(33) 



where represents the reduced model after the removal of the irrelevant variables 
at the mode. 



4.2 Numerical integration 

Laplace approximation may not perform well in certain will be seen in Section 

5. (3 and a 2 can be analytically integrated out of the joint posterior given in Eq. 6. 
The resulting likelihood conditioned on the prior variances is, 

Kylv- 1 ) oc |X'X + V- 1 !" 172 |VP 1/2 (A + S 2 y {n+u)/2 , (34) 

where 

S 2 = y'y - y'X (X'X + V" 1 ) ^ X'y; (35) 

see also (Chipman et al., 2001, Equations 3.11,3.12). The marginal likelihood con- 
ditioned over rj can now be obtained through integration as p(y\rj) = E^-iu [p(y|v)] 
where the expectation is taken over the prior distribution of v _1 . 

In order to ensure efficient sampling, we define a hypercube around the mode of 
the joint posterior in order to obtain a sampling region over ]X .^vj . The sampling 
region is the set |w~ 1 |max(0, v.j~ l — ka vT i) < vj 1 < vf x + ka v .\ where v,~ x is the 
modal value of v 7 1 , a -i is the square root of inverse curvature at the mode and k is 
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to be chosen to adjust the width of the box. 

5 Examples 

This section reports the results of a simulation study comparing the ARiS estimates 
with a number of computationally efficient penalized least squares methods. In the 
study we consider a model of the form y = X/3 + A/"(0, a 2 ). For each data set, we 
center y and scale the columns of X so that they have unit 2-norm. The lasso and 
elastic net were fit using the lars and elasticnet libraries in R. 

Model 0: This model is adopted from Zou (2006) and is a special case where 
the lasso estimate fails to improve asymptotically. The true regression coefficients 
are {3 = (5.6,5.6,5.6,0). The predictors Xj (i = 1, ...,n) are iid A/"(0, C) where C 
is defined in Zou (2006) (Corollary 1, pg. 1420) with pi = —.39 and p 2 = .23. 
Under this scenario, C does not allow consistent lasso selection. In this context Zou 
(2006) proposes the adaptive lasso (adalasso) for consistent model selection. In this 
setting, we simulate 1000 data sets from the above model for different combinations 
of sample size and error variance. Table 1 reports the proportion of the cases where 
the solution paths included the true model for ARiS, lasso and adaptive lasso. We 
also report the results of ARiS in the special case when 77 = 0. The results indicate 
that the ARiS algorithm performs nearly as well as the adaptive lasso and far better 
than the ordinary lasso in terms of consistent model selection under this particular 
setting. For rj = 0, the ARiS produces a consistent estimate and does not require 
a search over the solution path. For medium and large values of n we can see that 
it significantly outperforms the lasso. Results for the lasso and adalasso agree with 
those of Zou (2006). 

We next compare prediction accuracy and model selection consistency using the 
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following three models which are drawn from Tibshirani (1996). 

Model 1 : In this example, we let (3 = (3, 1.5, 0, 0, 2, 0, 0, 0)' with iid normal pre- 
dictors Xj (i = 1, ...,n). The pairwise correlation between the predictors Xj and x^ 
are adjusted to be (.S)'- 7- ^. 

Model 2: We use the same setup as model 1 with j3j = 0.85 for all j. 

Model 3: We use the same setup as model 1 with /3 = (5, 0, 0, 0, 0, 0, 0, 0)'. 

We test models 1,2, and 3 for two different sample sizes, n = 20, 100 and two noise 
levels a = 3, 6. This experiment is conducted 100 times under each setting. In Table 
4, we report the median prediction error (MSE) on a test set of 10,000 observation for 
each of the 100 cases. The values in the parentheses give the bootstrap standard error 
of the median MSE values obtained. C, I and CM respectively stand for the number of 
correct predictors chosen, number of incorrect predictors chosen and the proportion of 
cases (out of 100) where the correct model was found by the method. The bootstrap 
standard error was calculated by generating 500 bootstrap samples from 100 MSE 
values, finding the median MSE for each case, and then calculating the standard 
error of the median MSE. Lasso, adalasso, elastic net, nonnegative garrote, ridge and 
ordinary least squares estimates are computed along with the ARiS estimate. For 
the ridge estimator, the ridge parameter is determined by a GCV (generalized cross- 
validation) type statistic, while for all the others we use 10-fold cross-validation for 
the choice of the tuning parameters. We also consider the lasso where the tuning 
parameter is chosen by the method of Yuan and Lin (2005). ARiS hyper-parameter 
rj is determined both by the Laplace approximation and the numerical integration to 
the marginal likelihood. We also report the results for the particular case of r\ = 0. 
In each example the numerical integration step of ARiS-eB is carried out for values 
of k — 3, 10, 100, 1000 and only the best result is reported. This is a rather arbitrary 
choice and will depend upon the number of samples drawn. Model 3 is the only 
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example where the same value of k is consistently chosen (k = 1000). 

Model 3 demonstrates the most striking feature of the ARiS algorithm, the ability 
to identify the correct model under sparse setups. When utilized along with the 
empirical Bayes step, it is able to identify the correct model in a very large proportion 
of cases with very low prediction error. This is especially surprising for the cases where 
n = 20 (cr = 3, 6). In the case where n = 100 and r] = the algorithm still outperform 
all other methods in terms of correct model choice and MSE. 

Among all the variants of lasso (lasso, adalasso, elastic net, lasso (CML)), lasso(CML) 
is optimal in terms of prediction accuracy. In the case of n = 100, its prediction error 
is almost identical to that of ARiS(r/ = 0) but correct model identification is strongly 
weaker. Observe that a cross-validation approach may not accurately choose the 
tuning parameter for the lasso-variants. For example, as we moved from n = 20 to 
n = 100, the proportion of cases where the correct model was chosen decreased for 
all the lasso- variants except lasso(CML) where the tuning parameter is chosen via an 
empirical Bayes step similar to our approach. The nonnegative garrote estimator per- 
forms quite poorly in this situation along with the ridge and ols estimators. Results 
indicate that ARiS provides superior performance for model 3. 

In the case of model 1, for n = 100 and a = 3, ARiS performs best in terms 
of prediction accuracy and strongly outperforms other algorithms in terms of model 
selection accuracy. ARiS (77 = 0) outperformed other versions which required a search 
over the solution path. Both the Laplace approximation and the numerical integration 
fail to detect this value of r]. For the case of n = 20, ARiS performs within a standard 
error of all the other estimators in terms of prediction accuracy, yet does better in 
terms of model selection accuracy. The ridge estimator does almost as well as the 
lasso-variants in terms of prediction accuracy. Similar results follow for the case 
n = 100, a = 6. However, elastic net seems to have slightly lower prediction error. 
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The case n = 20, a = 6 shows fairly weak results across all estimators. 

Model 2 demonstrates the biggest weakness of the ARiS and several other esti- 
mators. When there are many small effects present in the underlying model, these 
estimators do not perform well since they favor parsimony. For all cases the clear 
winner is the ridge estimator. 

6 Conclusion 

We have introduced a Bayesian model fitting and variable selection method, ARiS, 
which makes use of a hierarchical model and enforces parsimony. The method com- 
bines an efficient optimization procedure which is tailored to the fully conditional 
posterior densities with various techniques to derive and maximize the marginal like- 
lihood. This development, although radically different in detail, is similar in spirit 
to modern implementation of the lasso which has been described as a Bayesian pro- 
cedure which combines a normal likelihood with a Laplace prior on the regression 
coefficients; see Tibshirani (1996). 

Considering the simulation results of Section 5, we note two key features of the 
ARiS: (i) its superior prediction and model selection accuracy when the underlying 
model is sparse, and (ii) the significant improvement in performance accompanying 
an increase in the sample size indicating asymptotic consistency. 

Computationally, for a specific t] value, ARiS requires one matrix inversion at each 
iteration. This point is obvious from the description of the method as a series of ridge 
regressions. Thus the computational cost for each iteration of ARiS is at most 0(p 3 ). 
In practice, because variables are eliminated throughout the procedure, the cost often 
decreases dramatically with each iteration. Our experiments indicate fast convergence 
of this procedure across sample sizes. Lasso methods offer a computational advantage 
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due to the lars algorithm (Efron et al., 2004) which can compute the entire solution 
path of the lasso with the cost of a single OLS estimator. However, our experimental 
results indicate that these methods are often inferior in terms of model selection and 
prediction accuracy. 

Large scale experiments have shown that the procedure remains computationally 
feasible in situations where the number of predictor variables is very large. Hence the 
proposed method offers the most advantage in problems where one is attempting to 
select a small or moderate number of variables from a large initial group, a common 
situation in many modern statistical and data mining applications. An open issue is 
the empirical Bayes step via numerical integration. Due to the large scale simulations 
throughout our experiments we have only drawn 1000 samples for the integration 
of v" 1 . Obviously in practice a much larger set of samples could be drawn at little 
additional cost particularly for sparse models. The process will become more stable 
as we draw larger samples. In such a case, the choice of k may just be fixed at a 
larger value, i.e. k = 1000. 
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A Appendix 

Proof of Theorem 1. f3 and a 2 can tractably be integrated out of (6). As a result of 
this integration, the only remaining terms that are dependent upon vj l are 

|X'X + V- 1 ^ 172 |Vf 1/2 (y'y - y'X (X'X + V- 1 )- 1 X'y)" n/2 n^" 1 ). (36) 

i=i 

It will suffice to show that (36) is finitely integrable with respect to vj . 

|x'x + y-^-^W' 112 = ix'xv + 1|~ 1/2 < ix'xv]- 1 / 2 = l^x]- 1 / 2 ^- 1 / 2 , (37) 

and 

y'X (X'X + V" 1 ) _1 X'y < y'X (X'X) -1 X'y. (38) 
Eliminating the terms again that are not dependent upon Vj, we reduce (36) to 



i=i 

Integrating (39) is equivalent to Y\ P j=i ^7 ' This expectation is taken with respect 
to (4) and is finite for \i > 0. □ 
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B Appendix 

Let = (/3, a 2 , v -1 )'. The negative Hessian, H , is given by 



O 2 



d 2 _ 

2\2 



(da 



logp(y, 
logp (y, 6 



d 2 

T^-— logp(y,0 



dv k dv m 







———\ogp(y,e 



2 



da 2 d(3 k 
d 2 



logp(y, 



da 2 dvl 



r logp(y, 6 



V) 

v) 
v) 

v) 

v) 
v) 



(Er=i x ik + v k X ) , k = m 



1 2A* 

- + -IT 



a' 



(7 



(| + 77), k 



m 



0. 



k ^ m 



%, k = m 
0, k ^ m 



a' 



i=l 



1% 



2a 4 ' 



(40) 
(41) 
(42) 

(43) 

(44) 
(45) 



References 

Berger, J. O. Statistical Decision Theory and Bayesian Analysis. New York: Springer 
(1985). 

Breiman, L. "Better Subset Regression Using the Nonnegative Garrote." Technomet- 
rics, 37(4):373-384 (1995). 

Candes, E. and Tao, T. "The Dantzig Selector: Statistical Estimation When p is 
Much Larger Than n." The Annals of Statistics, 35(6):2313-2351 (2007). 



24 



Chen, M.-H., Shao, Q.-M., and Ibrahim, J. G. Monte Carlo Methods in Bayesian 
Computation. Springer (2001). 

Chipman, H., George, E. I., and McCulloch, R. E. "The Practical Implementation of 
Bayesian Model Selection." IMS Lecture Notes - Monograph Series, 38 (2001). 

Drucker, H., Burges, C. J. C, Kaufman, L., Smola, A., and Vapnik, V. "Support 
Vector Regression Machines." In Mozer, M. C, Jordan, M. I., and Petsche, T. 
(eds.), Advances in Neural Information Processing Systems, volume 9, 155. The 
MIT Press (1997). 

D'Souza, A., Vijayakumar, S., and Schaal, S. "The Bayesian Backfitting Relevance 
Vector Machine." In ICML '04: Proceedings of the twenty-first international con- 
ference on Machine learning, 31 (2004). 

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. "Least Angle Regression." 
The Annals of Statistics, 32(2):407-499 (2004). 

George, E. I. "The Variable Selection Problem." Journal of the American Statistical 
Association, 95 (2000). 

Harville, D. A. "Maximum Likelihood Approaches to Variance Component Estima- 
tion and to Related Problems." Journal of the American Statistical Association, 
72(358):320-338 (1977). 

Hoerl, A. E. and Kennard, R. W. "Ridge Regression: Biased Estimation for 
Nonorthogonal Problems." Technometrics , 42(l):80-86 (2000). 

Lindley, D. V. and Smith, A. F. M. "Bayes Estimates for the Linear Model." Journal 
of the Royal Statistical Society. Series B (Methodological), 34 (1972). 



25 



Sorensen, D. and Gianola, D. Likelihood, Bayesian, and MCMC Methods in Quanti- 
tative Genetics. New York: Springer (2002). 

Tibshirani, R. "Regression Shrinkage and Selection via the Lasso." Journal of the 
Royal Statistical Society. Series B (Methodological) , 58(l):267-288 (1996). 

Tierney, L. and Kadane, J. B. "Accurate Approximations for Posterior Moments and 
Marginal Densities." Journal of the American Statistical Association, 81(393):82- 
86 (1986). 

Tipping, M. E. "Sparse Bayesian Learning and the Relevance Vector Machine." 
Journal of Machine Learning Research, 1 (2001). 

Tipping, M. E. and Lawrence, N. D. "Variational inference for Student-t models: Ro- 
bust Bayesian interpolation and generalised component analysis." Neurocomputing , 
69 (2005). 

Yuan, M. and Lin, Y. "Efficient Empirical Bayes Variable Selection and Estimation 
in Linear Models." Journal of the American Statistical Association, 100(472):1215- 
1225 (2005). 

Zou, H. "The Adaptive Lasso and Its Oracle Properties." Journal of the American 
Statistical Association, 101:1418-1429 (2006). 

Zou, H. and Hastie, T. "Regularization and variable selection via the elastic net." 
Journal Of The Royal Statistical Society Series B, 67(2):301-320 (2005). 



26 



Tables 



Table 1: Results for model 0. 



n = 60,a = 9 n = 120,(J = 5 n = 300, a = 3 
Lasso 0.499 0.489 0.498 
AdaLasso 0.724 0.935 0.996 
ARiS 0.671 0.927 1 
ARiS(rj = 0) 0.462 0.895 0.944 
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Table 2: Results for model 1. 
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Table 3: Results for model 2. 
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Table 4: Results for model 3. 
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Figure Captions 



Figure 1. Contours of the penalty imposed by the independent (log) t-priors for \i — 1 
and fi = 0.001. 

Figure 2. Visualization of the ARiS algorithm. Here the solid lines show the contours 
of the least squares function and the dashed lines show the adapted constraint region 
for each iteration. 
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Figures 




(a) (b) 

Figure 1: Contours of the penalty imposed by the independent (log) t-priors for \i — 1 
and fi = 0.001. 





Figure 2: Visualization of the ARiS algorithm. Here the solid lines show the contours 
of the least squares function and the dashed lines show the adapted constraint region 
for each iteration. 
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